DATA SCIENCE SESSIONS VOL. 3¶

A Foundational Python Data Science Course¶

Session 15: Regularization in MLR. The Maximum Likelihood Estimation (MLE).¶

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers¶

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


In [ ]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
from scipy.stats import norm
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor


# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - sklearn
from sklearn.linear_model import LinearRegression

# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()

# - rng
rng = np.random.default_rng(seed=513)

# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

0. The Problem of Overfitting¶

Overfitting is a common problem in machine learning where a model learns to fit the training data too closely, leading to poor performance on new or unseen data. In other words, the model becomes too complex and starts to memorize the training data instead of learning the underlying patterns and relationships.

Overfitting occurs when a model is trained on a limited set of data, and the model is too complex, with too many parameters or features, to generalize to new data. This can lead to a model that performs very well on the training data but poorly on new data.

Overfitting can be identified by comparing the model's performance on the training data to its performance on a validation set or test data. If the model performs significantly better on the training data than on the validation set, then it is likely overfitting.

To prevent overfitting, one can use techniques such as regularization, early stopping, or reducing the complexity of the model. Regularization techniques, such as L1 or L2 regularization, can penalize the model for having too many features or high parameter values, while early stopping can stop the training process before the model starts to overfit. Additionally, reducing the complexity of the model, such as by decreasing the number of features or using a simpler model architecture, can also help prevent overfitting.

In [ ]:
# - generate A as 
x = norm.rvs(loc=15, scale = 10, size = 1000000)
# - B is a noisy linear transformation of A:
error_y = norm.rvs(loc=0, scale = 4, size = 1000000)
intercept_y = 7.5
slope_y = .78
y = intercept_y + slope_y*x + error_y
data = pd.DataFrame({'x':x, 'y':y})
data
Out[ ]:
x y
0 17.633312 20.255282
1 8.831127 16.299912
2 24.547937 23.651645
3 29.206100 32.108250
4 36.844515 34.556358
... ... ...
999995 29.450137 32.518768
999996 11.005039 18.196482
999997 25.596081 33.349529
999998 19.723159 18.062661
999999 24.089208 34.025072

1000000 rows × 2 columns

In [ ]:
# create a sample dataframe

# create a scatterplot of x and y
plt.scatter(data['x'], data['y'])

# create a linear regression model
lr = LinearRegression()
lr.fit(data[['x']], data['y'])

# create a range of x values to plot the regression line
x_range = np.linspace(data['x'].min(), data['x'].max(), 100)

# predict y values using the linear regression model and x_range values
y_pred = lr.predict(x_range.reshape(-1,1))

# add the regression line to the scatterplot
plt.plot(x_range, y_pred, color='red')

# add labels and title to the plot
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatterplot of x and y with Regression Line')

# show the plot
plt.show()

# R2
R2 = lr.score(data[['x']], data['y'])
print(f'R2={R2}')
/home/ikacikac/workspace/dss03python2023/.venv/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(
R2=0.7912635443084501
In [ ]:
# create an empty DataFrame to store the regression results
results_df = pd.DataFrame(columns=['Beta_0', 'Beta_1', 'R2'])

# perform 10000 linear regressions on samples of size 1000
for i in range(10000):
    # take a sample of size 100 without replacement
    sample = data.sample(n=100, replace=False)
    
    # perform linear regression on the sample
    lr = LinearRegression()
    lr.fit(sample[['x']], sample['y'])
    
    # extract the intercept, slope, and R2 from the linear regression
    beta_0 = lr.intercept_
    beta_1 = lr.coef_[0]
    r2 = lr.score(sample[['x']], sample['y'])
    
    # add the results to the results DataFrame
    results_df.loc[i] = [beta_0, beta_1, r2]

# display the results DataFrame
print(results_df)
        Beta_0    Beta_1        R2
0     7.959892  0.747901  0.724778
1     8.219678  0.728142  0.760293
2     8.951546  0.695060  0.713171
3     8.410322  0.753360  0.813525
4     7.058162  0.832473  0.827351
...        ...       ...       ...
9995  8.135709  0.724335  0.744519
9996  6.260508  0.834792  0.785136
9997  7.499823  0.751799  0.783756
9998  6.838027  0.815565  0.759536
9999  5.921213  0.839471  0.815359

[10000 rows x 3 columns]
In [ ]:
results_df.hist('R2', bins=100)
Out[ ]:
array([[<AxesSubplot: title={'center': 'R2'}>]], dtype=object)
In [ ]:
# take 12 random samples of size 1000 without replacement
samples = [data.sample(n=100, replace=False) for i in range(12)]

# create a 4 x 3 grid of subplots
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 9))

# loop through each subplot and plot the scatterplot and regression line for one sample
for i, ax in enumerate(axes.flat):
    # perform linear regression on the sample
    lr = LinearRegression()
    lr.fit(samples[i][['x']], samples[i]['y'])
    
    # plot the scatterplot
    sns.scatterplot(x='x', y='y', data=samples[i], ax=ax)
    
    # plot the regression line
    sns.lineplot(x=samples[i]['x'], y=lr.predict(samples[i][['x']]), color='red', ax=ax)
    
    # set the title of the subplot to the R2 score
    ax.set_title('R2: {:.2f}'.format(lr.score(samples[i][['x']], samples[i]['y'])))
    
# set the overall title of the plot
fig.suptitle('Scatterplots with Regression Lines for 12 Samples')

# adjust the spacing between subplots
plt.tight_layout()

# show the plot
plt.show()

1. Regularization in Multiple Linear Regression¶

1.1 Lasso (L1), Ridge (L2), and Elastic Net regression¶

Regularization is used for several purposes:

  • it helps reduce the problem of multicollinearity in Linear Regression,
  • it is used as method of feature selection in large models, and
  • it is used as a method to prevent overfitting.

Consider the following:

$$Loss = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}$$

It is just the ordinary sum of squares, or the loss (or cost) function of the Linear Regression Model. The parameters $\beta$ of the Linear Model are estimate by minimizing the above quantity - the model's cost function. The expression above is just the ordinary $SSE$ function as we have encountered it before in our sessions on Linear Regression.

Now, consider the following formulation of the cost function that includes the penalty term:

$$Loss_{L2} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}\beta_j^2$$

$\lambda\sum_{j=1}^{p}\beta_j^2$ is the penalty term: it increases the value of the cost function by a factor determined by the sum of the squared model coefficients. In the expression above $p$ stands for the number of model parameters (coefficients, in the case of Linear Regression) and $\beta$ are the coefficients themselves. The above expression represents the $L2$ regularization for the Linear Model, also known as Ridge Regression. If $\lambda$ is zero we get back to the Ordinary Least Squares model that we have already learned about. If $\lambda$ is too large, it will add too much weight to the cost function which will lead to underfitting.

The underlying idea is to penalize large coefficients: it regularizes the coefficients so that if they take large values the cost function is penalized. In effect, the Ridge regression shrinks the coefficients so to reduce the model complexity and reduces the effects multicollinearity.

Consider now the following modification of the cost function:

$$Loss_{L1} = \sum_{i=1}^{n}{\big(y_i-\hat{y_i}\big)^2}+\lambda\sum_{j=1}^{p}|\beta_j|$$

The above cost function is based on a weighted sum of the absolute values of the model coefficients. This approach is known as $L1$ or Lasso Regression.

The most important difference between Ridge and Lasso regularization is that in Lasso some coefficients can be shrinked to zero and get completely eliminated from the model. Thus, Lasso regularization does not only prevent overfitting but also performs feature selection - a very handy thing when considering large linear models.

Finally, consider the following formulation of a loss (or cost) function:

$$Loss_{ElasticNet} = \frac{\sum_{i=1}^{n}(y_i-\hat{y_i})^2}{2n}+\lambda(\frac{1-\alpha}{2}\sum_{j=1}^{m}\hat{\beta_j}^2+\alpha\sum_{j=1}^{m}|\hat{\beta_j}|)$$

In the expression above - the Elastic Net regression - $\alpha$ is the mixing parameter between Lasso ($\alpha=1$) and Ridge ($\alpha=0$) approach. While the Ridge regression shrinks the coefficients of correlated predictors towards each other, the Lasso regression tends to pick one of them and decrease the others. The Elastic Net penalty is a mixture of the two approaches controlled by $\alpha$.

1.2 Inspect Multicolinearity in one particular Linear Regression problem¶

In [ ]:
data_set = pd.read_csv(os.path.join(data_dir, 'kc_house_data.csv'))
data_set.head(10)
Out[ ]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
5 7237550310 20140512T000000 1225000.0 4 4.50 5420 101930 1.0 0 0 ... 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930
6 1321400060 20140627T000000 257500.0 3 2.25 1715 6819 2.0 0 0 ... 7 1715 0 1995 0 98003 47.3097 -122.327 2238 6819
7 2008000270 20150115T000000 291850.0 3 1.50 1060 9711 1.0 0 0 ... 7 1060 0 1963 0 98198 47.4095 -122.315 1650 9711
8 2414600126 20150415T000000 229500.0 3 1.00 1780 7470 1.0 0 0 ... 7 1050 730 1960 0 98146 47.5123 -122.337 1780 8113
9 3793500160 20150312T000000 323000.0 3 2.50 1890 6560 2.0 0 0 ... 7 1890 0 2003 0 98038 47.3684 -122.031 2390 7570

10 rows × 21 columns

In [ ]:
data_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

Target: predict price from the numerical predictors¶

In [ ]:
# - for the simplicity we will drop some variables from the dataset
model_frame = data_set.drop(columns = ['id', 'lat', 'long', 'date',
                                      'zipcode', 'yr_renovated', 'waterfront',
                                      'view', 'sqft_basement'])
model_frame.head()
Out[ ]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 3 7 1180 1955 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 3 7 2170 1951 1690 7639
2 180000.0 2 1.00 770 10000 1.0 3 6 770 1933 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 5 7 1050 1965 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 3 8 1680 1987 1800 7503
In [ ]:
predictors = model_frame.columns.drop('price')
predictors
Out[ ]:
Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'condition', 'grade', 'sqft_above', 'yr_built', 'sqft_living15',
       'sqft_lot15'],
      dtype='object')
In [ ]:
# - the correlations of the predictors
model_frame[predictors].corr()
Out[ ]:
bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
bedrooms 1.000000 0.515884 0.576671 0.031703 0.175429 0.028472 0.356967 0.477600 0.154178 0.391638 0.029244
bathrooms 0.515884 1.000000 0.754665 0.087740 0.500653 -0.124982 0.664983 0.685342 0.506019 0.568634 0.087175
sqft_living 0.576671 0.754665 1.000000 0.172826 0.353949 -0.058753 0.762704 0.876597 0.318049 0.756420 0.183286
sqft_lot 0.031703 0.087740 0.172826 1.000000 -0.005201 -0.008958 0.113621 0.183512 0.053080 0.144608 0.718557
floors 0.175429 0.500653 0.353949 -0.005201 1.000000 -0.263768 0.458183 0.523885 0.489319 0.279885 -0.011269
condition 0.028472 -0.124982 -0.058753 -0.008958 -0.263768 1.000000 -0.144674 -0.158214 -0.361417 -0.092824 -0.003406
grade 0.356967 0.664983 0.762704 0.113621 0.458183 -0.144674 1.000000 0.755923 0.446963 0.713202 0.119248
sqft_above 0.477600 0.685342 0.876597 0.183512 0.523885 -0.158214 0.755923 1.000000 0.423898 0.731870 0.194050
yr_built 0.154178 0.506019 0.318049 0.053080 0.489319 -0.361417 0.446963 0.423898 1.000000 0.326229 0.070958
sqft_living15 0.391638 0.568634 0.756420 0.144608 0.279885 -0.092824 0.713202 0.731870 0.326229 1.000000 0.183192
sqft_lot15 0.029244 0.087175 0.183286 0.718557 -0.011269 -0.003406 0.119248 0.194050 0.070958 0.183192 1.000000
In [ ]:
# --- Composing the fomula of the model

# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'price ~ ' + formula

formula
Out[ ]:
'price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + condition + grade + sqft_above + yr_built + sqft_living15 + sqft_lot15'
In [ ]:
# - fitting the linear model to the data
mlr = smf.ols(formula=formula, data=model_frame).fit()
mlr.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: price R-squared: 0.620
Model: OLS Adj. R-squared: 0.620
Method: Least Squares F-statistic: 3202.
Date: Tue, 25 Apr 2023 Prob (F-statistic): 0.00
Time: 16:54:16 Log-Likelihood: -2.9715e+05
No. Observations: 21613 AIC: 5.943e+05
Df Residuals: 21601 BIC: 5.944e+05
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.933e+06 1.36e+05 51.002 0.000 6.67e+06 7.2e+06
bedrooms -4.936e+04 2108.805 -23.406 0.000 -5.35e+04 -4.52e+04
bathrooms 5.01e+04 3625.970 13.817 0.000 4.3e+04 5.72e+04
sqft_living 195.1515 4.812 40.552 0.000 185.719 204.584
sqft_lot 0.0111 0.054 0.207 0.836 -0.094 0.116
floors 3.348e+04 3957.258 8.461 0.000 2.57e+04 4.12e+04
condition 1.898e+04 2580.038 7.356 0.000 1.39e+04 2.4e+04
grade 1.246e+05 2346.774 53.099 0.000 1.2e+05 1.29e+05
sqft_above -28.6643 4.678 -6.128 0.000 -37.833 -19.496
yr_built -3968.3906 69.774 -56.875 0.000 -4105.152 -3831.629
sqft_living15 36.2656 3.735 9.709 0.000 28.944 43.587
sqft_lot15 -0.5066 0.082 -6.172 0.000 -0.667 -0.346
Omnibus: 17771.672 Durbin-Watson: 1.984
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1362466.098
Skew: 3.467 Prob(JB): 0.00
Kurtosis: 41.274 Cond. No. 4.47e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.47e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
# - R^2 of the linear model
mlr.rsquared
Out[ ]:
0.6198708417378094
In [ ]:
# - computing VIFs
model_frame_predictors = sm.add_constant(model_frame[predictors])
vifs = [variance_inflation_factor(model_frame_predictors.values, i) for i in range(1, len(predictors)+1)]
vifs = np.array(vifs).reshape(1, -1)
vifs
pd.DataFrame(vifs, columns=predictors)
Out[ ]:
bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 1.621833 3.287936 8.236167 2.085978 1.925165 1.188449 3.208241 6.325912 1.770933 2.763124 2.117307

1.3 Regularization of the model coefficients: Ridge (L2) Regularization¶

Now we are going to use scikit-learn to perform Regularized Multiple Linear Regression; we'll be using the Ridge Regression.

As we have see, the Ridge Regression model is obtained by minimizing the function

$$SSE + \alpha L_2^2,$$

where $SSE$ is the Sum Squarred Error of the 'ordinary' Multiple Linear Regression

$$\hat{y} = \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0,$$

and $L_{2}^2$ is the squared $L_{2}-$ norm of Multiple Linear Regression model parameters $\beta_1,\beta_2,\ldots,\beta_k$:

$$L_2^2 = \beta_1^2 + \beta_2^2 + \cdots + \beta_k^2,$$

and hence the name L2 Regularization. $\alpha$ is called penalization (hyper)parameter, and it's used to control the magnitude of models parameters; for $\alpha = 0$ we recover ordinary MLR.

In [ ]:
from sklearn import linear_model
In [ ]:
model_frame.head()
Out[ ]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_above yr_built sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 3 7 1180 1955 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 3 7 2170 1951 1690 7639
2 180000.0 2 1.00 770 10000 1.0 3 6 770 1933 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 5 7 1050 1965 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 3 8 1680 1987 1800 7503
In [ ]:
# - checking if all the variables are numerical
model_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          21613 non-null  float64
 1   bedrooms       21613 non-null  int64  
 2   bathrooms      21613 non-null  float64
 3   sqft_living    21613 non-null  int64  
 4   sqft_lot       21613 non-null  int64  
 5   floors         21613 non-null  float64
 6   condition      21613 non-null  int64  
 7   grade          21613 non-null  int64  
 8   sqft_above     21613 non-null  int64  
 9   yr_built       21613 non-null  int64  
 10  sqft_living15  21613 non-null  int64  
 11  sqft_lot15     21613 non-null  int64  
dtypes: float64(3), int64(9)
memory usage: 2.0 MB
In [ ]:
# - feature matrix
X = model_frame[predictors].values

# - target vector
y = model_frame['price'].values
In [ ]:
### --- Fitting a Ridge MLR Regularized model on the given data, with the penalization parameter \alpha = 10^4
mlr_ridge = linear_model.Ridge(alpha=1e+4)
mlr_ridge.fit(X, y)
Out[ ]:
Ridge(alpha=10000.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=10000.0)
In [ ]:
# - model intercept
mlr_ridge.intercept_
Out[ ]:
5945635.344976561
In [ ]:
# - model coefficients; the order corresponds to the order of coefficients given by the statsmodel 'ordinary' MLR
mlr_ridge.coef_
Out[ ]:
array([-2.91771401e+04,  1.81662857e+04,  2.26257901e+02,  1.41002993e-02,
        1.59074953e+04,  7.12968872e+03,  6.53923517e+04, -5.50894014e+00,
       -3.28129969e+03,  6.05780655e+01, -6.44587129e-01])
In [ ]:
# - R^2 score of the Ridge Regularization
mlr_ridge.score(X, y)
Out[ ]:
0.6013062798111166
In [ ]:
### --- Varying the penalization parameter for the Ridge Regularized Model
n_alphas = 200
alphas = np.logspace(-2, 10, n_alphas)

coefs = []
scores = []
mserrs = []
l2_norms = []
for a in alphas:
    # - fitting a Ridge Regularization
    ridge = linear_model.Ridge(alpha=a)
    ridge.fit(X, y)
    
    # - R^2 score of a fitted Regularized Model
    R2 = ridge.score(X, y)
    
    # - MSE of a model
    mse = np.mean((y - ridge.predict(X))**2)
    
    # - squared L2 norm of model coefficients
    l2 = np.linalg.norm(ridge.coef_, ord=2)**2
    
    coefs.append(ridge.coef_)
    scores.append(R2)
    mserrs.append(mse)
    l2_norms.append(l2)
In [ ]:
### --- Plotting the dependence of model parameters vs. the penalization parameter
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
ax.axvline(x=1, c='r', ls='--') # red line, corresponding to \alpha = 1
ax.axvline(x=10000, c='g', ls='--') # green line, corresponding to \alpha = 10^4
plt.xlabel("alpha")
plt.ylabel("coefficient")
plt.title("Ridge coefficients as a function of the regularization")
plt.axis("tight")
plt.legend(predictors)
plt.show()
In [ ]:
### --- Plotting the dependance of various model metrics vs. the penalization hyperparameter
ig, ax = plt.subplots(1, 3, figsize=(24, 6), sharex=True)


# - squared L2-norm vs. \alpha
ax[0].plot(alphas, l2_norms)
ax[0].set_xscale("log")
ax[0].set_xlim(ax[0].get_xlim()[::-1])  # reverse axis
ax[0].axvline(x=1, c='r', ls='--') # red line
ax[0].axvline(x=10000, c='g', ls='--') # green line
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("$L_2$ norm$^2$")
ax[0].set_title("$L_2$ norm$^2$ as a function of the regularization")
ax[0].axis("tight")

# - MSE vs. \alpha
ax[1].plot(alphas, mserrs)
ax[1].set_xscale("log")
ax[1].axvline(x=1, c='r', ls='--') # red line
ax[1].axvline(x=10000, c='g', ls='--') # green line
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("MSE")
ax[1].set_title("MSE as a function of the regularization")
ax[1].axis("tight")

# - R^2 vs. \alpha
ax[2].plot(alphas, scores)
ax[2].set_xscale("log")
ax[2].axvline(x=1, c='r', ls='--') # red line
ax[2].axvline(x=10000, c='g', ls='--') # green line
ax[2].set_xlabel("alpha")
ax[2].set_ylabel("$R^2$")
ax[2].set_title("$R^2$ score as a function of the regularization")
ax[2].axis("tight");

We see that the best model is obtained with small penalization hyperparameter, i.e. with large model parameter (coefficient) values.

1.4 LASSO (L1) Regularization¶

LASSO is another way to regularize MLR model. It's obtained by minimizing the function

$$SSE + \alpha L_1,$$

where $SSE$ is Sum Squared Error of the ordinary Multiple Linear Regression and $L_1$ is $L_1-$ norm of its coeffitents given by

$$L_1 = |\beta_1| + |\beta_2| + \cdots + |\beta_k|,$$

and hence the name L1 Regularization. $\alpha$ is, of course, penalization hyperparameter.

In [ ]:
### --- Varying the penalization parameter for the LASSO Regularized Model
n_alphas = 200
alphas = np.logspace(-2, 10, n_alphas)

coefs = []
scores = []
mserrs = []
l1_norms = []
for a in alphas:
    # - fitting a LASSO Regularization
    lasso = linear_model.Lasso(alpha=a)
    lasso.fit(X, y)
    
    # - R^2 score of a fitted Regularized Model
    R2 = lasso.score(X, y)
    
    # - MSE of a model
    mse = np.mean((y - lasso.predict(X))**2)
    
    # - L1 norm of model coefficients
    l1 = np.linalg.norm(lasso.coef_, ord=1)
    
    coefs.append(lasso.coef_)
    scores.append(R2)
    mserrs.append(mse)
    l1_norms.append(l1)
In [ ]:
### --- Plotting the dependence of model parameters vs. the penalization parameter
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis, 
ax.axvline(x=1, c='r', ls='--') # red line, corresponding to \alpha = 1
plt.xlabel("alpha")
plt.ylabel("weights")
plt.title("LASSO coefficients as a function of the regularization")
plt.axis("tight")
plt.legend(predictors)
plt.show()
In [ ]:
### --- Plotting the dependance of various model metrics vs. the penalization hyperparameter
ig, ax = plt.subplots(1, 3, figsize=(24, 6), sharex=True)

# - L1-norm vs. \alpha
ax[0].plot(alphas, l1_norms)
ax[0].set_xscale("log")
ax[0].set_xlim(ax[0].get_xlim()[::-1])  # reverse axis
ax[0].axvline(x=1, c='r', ls='--') # red line
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("L1 norm")
ax[0].set_title("L1 norm as a function of the regularization")
ax[0].axis("tight")

# - MSE vs. \alpha
ax[1].plot(alphas, mserrs)
ax[1].set_xscale("log")
ax[1].axvline(x=1, c='r', ls='--') # red line
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("MSE")
ax[1].set_title("MSE as a function of the regularization")
ax[1].axis("tight")

# - R^2 vs. \alpha
ax[2].plot(alphas, scores)
ax[2].set_xscale("log")
ax[2].axvline(x=1, c='r', ls='--') # red line
ax[2].set_xlabel("alpha")
ax[2].set_ylabel("$R^2$")
ax[2].set_title("$R^2$ score as a function of the regularization")
ax[2].axis("tight");

We see that the best model is obtained with small penalization hyperparameter, i.e. with big model parameters values.

2. Maximum Likelihood Estimator (MLE)¶

Maximum Likelihood Estimation (MLE) is yet another approach in estimating model parameters. It is a way of estimating the parameter(s) of probability distribution that generated the sample that we have at hand. Before showing the examples of how parameters values are estimated using MLE, let's refresh some of the things we talked about during the session 11.

Back then, we have introduced something that is called Bayes' Theorem. Given that we have two events $A$ and $B$, and given that $P(A)\neq0$ we can define:

$$ P(B|A) = \frac{P(A|B)P(B)}{P(A)}.$$

Also, we said that $P(A|B)$ is also called likelihood, and introduced another way of naming it. Namely $\mathcal{L}(B|A)$. If you remember we said that:

$$\mathcal{L}(B|A) = P(A|B).$$

So, the likelihood of the event $B$ occuring given that the event $A$ has occured is equal to the conditional probability of event $A$ occuring under the condition that the event $B$ has occured.

If we speak in terms of random varable and distribution parameters we usually write:

$$P(\theta; X) = \frac{\mathcal{L}(\theta; X)P(\theta)}{P(X)}$$

and

$$\mathcal{L}(\theta; X) = P(X; \theta).$$

where $X$ is RV and $\theta$ are its distributions' parameters (when there is a single parameter $\theta$ is scallar, and when there are multiple parameters ${\theta}$ is a vector).

Now, imagine we have RV $X$ sample of 3 observations $x_1, x_2, x_3$. In this case, previously rewritten Bayes' Theorem would be:

$$P(\theta; x1, x2, x3) = \frac{\mathcal{L}(\theta; x_1, x_2, x_3)P(\theta)}{P(x_1, x_2, x_3)}.$$

If we focus just on $\mathcal{L}(\theta; x_1, x_2, x_3)$ term, and use already defined equality $\mathcal{L}(\theta; X) = P(X; \theta)$, knowing that $x_1, x_2$ and $x_3$ are statistically independent, we have:

$$\mathcal{L}(\theta; x_1, x_2, x_3) = P(x_1, x_2, x_3; \theta) = P(x_1; \theta) * P(x_2; \theta) * P(x_3; \theta). $$

In general, if we have the sample of N observations, we have:

$$\mathcal{L}(\theta; x_1, x_2,..., x_N) = \prod_{i=1}^{N}{P(x_i; \theta)}. $$

This multiplication can easily get computationally intractable in practice. This is the reason why we define log-likelihood as:

$$\mathcal{l}(\theta; x_1, x_2,..., x_N) = log(\prod_{i=1}^{N}{P(x_i; \theta)}) = \sum_{i=1}^{N}{log(P(x_i; \theta))}.$$

This works because the logarithm is a strictly increasing function, and by maximizing the log-likelihood we are in essence maximizing likelihood.

And now, the examples.

Tossing a coin¶

Imagine I have a fair coin: the one that lands Heads and Tails with equal probabilities P(H) = P(T) = 1/2.

In [ ]:
tosses = np.array([1,0,1,0,1,0,0,1,0,0])
print(tosses)
[1 0 1 0 1 0 0 1 0 0]
In [ ]:
np.unique(tosses, return_counts=True)[1]
Out[ ]:
array([6, 4])

Uhm, but I have already assumed that we will be tossing a fair coin? Four heads, six tails, how come? Is this really a fair coin? The answer is: it still might be. Let's see.

In [ ]:
outcomes = np.array([0,1])
tosses = np.random.choice(outcomes, 10, replace=True, p=[.5, .5])
print(tosses)
print(np.unique(tosses, return_counts=True)[1]/tosses.size)
[1 1 1 1 0 0 1 1 1 0]
[0.3 0.7]

What about now?

In [ ]:
outcomes = np.array([0,1])
tosses = np.random.choice(outcomes, 10, replace=True, p=[.5, .5])
print(tosses)
print(np.unique(tosses, return_counts=True)[1]/tosses.size)
[0 1 0 0 0 0 0 0 1 0]
[0.8 0.2]

Let's create 1000 samples and check the proportions of heas and tails.

In [ ]:
outcomes = np.array([0,1])
num_exps = 1_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)
    
stat_exp_1 = pd.DataFrame(stat_exp_1)    
stat_exp_1.head(20)
Out[ ]:
P_T P_H
0 0.55 0.45
1 0.53 0.47
2 0.47 0.53
3 0.46 0.54
4 0.59 0.41
5 0.55 0.45
6 0.46 0.54
7 0.40 0.60
8 0.55 0.45
9 0.46 0.54
10 0.54 0.46
11 0.49 0.51
12 0.59 0.41
13 0.54 0.46
14 0.55 0.45
15 0.58 0.42
16 0.51 0.49
17 0.48 0.52
18 0.55 0.45
19 0.48 0.52
In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

Ok, now let' see what happens with 10,000 statistical experiments with a fair coin:

In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

outcomes = np.array([0,1])
num_exps = 10_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)

stat_exp_1 = pd.DataFrame(stat_exp_1)
stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

What about 100,000 experiments?

In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [10, 4]

outcomes = np.array([0,1])
num_exps = 100_000
stat_exp_1 = []

for i in range(num_exps):
    tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])
    result = np.array([(tosses==0).sum(), (tosses==1).sum()])/tosses.size
    result = {'P_T':result[0], 'P_H':result[1]}
    stat_exp_1.append(result)
    
stat_exp_1 = pd.DataFrame(stat_exp_1)
stat_exp_1.hist(bins=100)

sns.despine(offset=10, trim=True)

Theoretical probability: when we know that some stochastic system produces an outcome with some particular probability.

Experimental probability: when we estimate that some stochastic system produces an outcome with some particular probability from observed data.

In [ ]:
stat_exp_1['P_H'].mean()
Out[ ]:
0.49986010000000003
In [ ]:
stat_exp_1['P_T'].mean()
Out[ ]:
0.5001399000000001

Reverse engineering the coin¶

In all my previous statistical experiments, I have assumed that the coin is fair, e.g. P(H) = P(T) = .5, look:

tosses = np.random.choice(outcomes, 100, replace=True, p=[.5, .5])

I have used numerical simulations of coin tosses to demonstrate that with increasing number of experiments I get to estimate the known probabilities of P(H) and P(T) better and better.

However, we typically do not know the theoretical probability of a random event... Example: the CTR of some social media post.

CTR (Click-Through Rate) is like a coin toss: a post is clicked or not clicked upon each served impression. We can thus threat the CTR like a coin toss: P(H) = P(Clicked) is "post clicked", P(T) = P(Not clicked) is "not clicked". Then we can say that P(Clicked) is some latent, essential characteristic of post.

Imagine we have a set of observations for a particular social media post: 1,1,1,0,0,0,1,0,1,0,1,0,0,0,1,0.., where 1 = Clicked and 0 = Not clicked. How do we estimate the CTR of the post from these data?

Easy, right: sum up the vector of observations and divide by its length. But wait. There's more to it.

I have just invented some social media post with a CTR of .3, and estimated its experimental probability to be .3035; that is the best estimate that I currently have. What if I want to test a set of hypotheses about the CTR of this post? What, for example, if I am intersted to learn about the probability of this post having a CTR of .25 instead?

Let's use a simpler, smaller set of observations to begin with.

Imagine we toss a fair coin with P(H)=.5 twice and observe two Heads. The probability of observing two heads with P(H)=.5 is:

.5 * .5 = .25

But what if the coin was not fair, and instead had a probability of P(H)=.3? Then:

.3 * .3 = 0.09

And what if the coin had P(H) = .7?

.7 * .7 = 0.49

With P(H) = .9 we get:

.9 * .9 = 0.81

and it seems intuitive and logical: if we have observed a sample - however small it was - of all heads (and are assuming that we have observed two Heads from two tosses here), it makes sense that we are dealing with a coin that almost always lands Heads.

Let's express this intuition visually. First, we need a function to compute the probability that a coin with a certain P(H) produced the data.

In [ ]:
outcomes = np.array([0,1])
clicks = np.random.choice(outcomes, 10000, replace=True, p=[.7, .3])
ctr = np.array([(clicks==1).sum()])/clicks.size
print(ctr)
[0.3022]

Let's define a helper function for coin tossing. It is going to give us probability of observation, based on probability of getting Heads. If we say $P(H)=.2$ then each time we get Heads, function returns 0.2.

In [ ]:
def parameter(observation, heads_probability):

    if observation==1:
        return heads_probability
    
    elif observation==0:
        return 1 - heads_probability
    
    else:
        return np.nan

# we are vectorizing the function
pars = np.vectorize(parameter)

# here is our sample of coin toss results
data = np.array([1, 1, 0])

# these are the probabilities of each coin toss result
print(pars(data, .25))
[0.25 0.25 0.75]

Remember that we have defined likelihood as multiplication of each observation's probability? Here we define function do to just that.

In [ ]:
def likelihood(x):
    return np.prod(x)

print(likelihood(pars(data, .25)))
0.046875

Nice. Now, let's define our parameter's domain. Remember, we are trying to find maximum likelihood for probability of getting Heads ($P(H)$). Since this is a probability, domain is range $[0, 1]$.

In [ ]:
test_parameters = np.linspace(0, 1, 1000)

# first 100 values of P(H)
print(test_parameters[0:100])
[0.         0.001001   0.002002   0.003003   0.004004   0.00500501
 0.00600601 0.00700701 0.00800801 0.00900901 0.01001001 0.01101101
 0.01201201 0.01301301 0.01401401 0.01501502 0.01601602 0.01701702
 0.01801802 0.01901902 0.02002002 0.02102102 0.02202202 0.02302302
 0.02402402 0.02502503 0.02602603 0.02702703 0.02802803 0.02902903
 0.03003003 0.03103103 0.03203203 0.03303303 0.03403403 0.03503504
 0.03603604 0.03703704 0.03803804 0.03903904 0.04004004 0.04104104
 0.04204204 0.04304304 0.04404404 0.04504505 0.04604605 0.04704705
 0.04804805 0.04904905 0.05005005 0.05105105 0.05205205 0.05305305
 0.05405405 0.05505506 0.05605606 0.05705706 0.05805806 0.05905906
 0.06006006 0.06106106 0.06206206 0.06306306 0.06406406 0.06506507
 0.06606607 0.06706707 0.06806807 0.06906907 0.07007007 0.07107107
 0.07207207 0.07307307 0.07407407 0.07507508 0.07607608 0.07707708
 0.07807808 0.07907908 0.08008008 0.08108108 0.08208208 0.08308308
 0.08408408 0.08508509 0.08608609 0.08708709 0.08808809 0.08908909
 0.09009009 0.09109109 0.09209209 0.09309309 0.09409409 0.0950951
 0.0960961  0.0970971  0.0980981  0.0990991 ]
In [ ]:
# - setup
likelihood_frame = []
outcomes = np.array([0,1])
data = np.random.choice(outcomes, 1000, replace=True, p=[.7, .3])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H':pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.0
1 0.001001 0.0
2 0.002002 0.0
3 0.003003 0.0
4 0.004004 0.0
5 0.005005 0.0
6 0.006006 0.0
7 0.007007 0.0
8 0.008008 0.0
9 0.009009 0.0
In [ ]:
plt.rcParams['figure.figsize'] = [6, 5]

likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)

Let's ask explicitly:

In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]

likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.3023023023023023

And what if P(H) was .85?

In [ ]:
# - setup
likelihood_frame = []
outcomes = np.array([0,1])
data = np.random.choice(outcomes, 1000, replace=True, p=[.15, .85])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.0
1 0.001001 0.0
2 0.002002 0.0
3 0.003003 0.0
4 0.004004 0.0
5 0.005005 0.0
6 0.006006 0.0
7 0.007007 0.0
8 0.008008 0.0
9 0.009009 0.0
In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]

likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.8438438438438438

Pick any arbitrary observation

In [ ]:
# - setup
likelihood_frame = []
data = np.array([1,1,0,1,1,1,1,1,0,0,0,0,1,1,0,1,1])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)
Out[ ]:
P_H Likelihood
0 0.000000 0.000000e+00
1 0.001001 1.005009e-33
2 0.002002 2.045915e-30
3 0.003003 1.759043e-28
4 0.004004 4.139855e-27
5 0.005005 4.790436e-26
6 0.006006 3.537903e-25
7 0.007007 1.916616e-24
8 0.008008 8.275962e-24
9 0.009009 3.005145e-23
In [ ]:
# - matplotlib figure size
plt.rcParams['figure.figsize'] = [6, 5]

# - plot
likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)
In [ ]:
maximum_likelihood = likelihood_frame.idxmax()[1]
likelihood_frame['P_H'][maximum_likelihood]
Out[ ]:
0.6466466466466466
In [ ]:
np.unique(data, return_counts=True)[1]
Out[ ]:
array([ 6, 11])

Good. What have we learned?

To estimate the parameter of a simple statistical model from data - a coin, in our case, a simple stochastic system with two outcome states only - we need to find the maximum of its likelihood function given the observed data.

For our example with the HH minimal observations - tossing a coin two times and observing two Heads in a row - the likelihood function is:

In [ ]:
# - setup
likelihood_frame = []
data = np.array([1,1])

# - iterate
for pr in test_parameters:
    d = likelihood(pars(data, pr))
    d = {'P_H': pr, 'Likelihood':d}
    likelihood_frame.append(d)

# - compose
likelihood_frame = pd.DataFrame(likelihood_frame)

# - inspect
likelihood_frame.head(10)

# - matplotlib figure size
plt.rcParams['figure.figsize'] = [6, 5]

# - plot
likelihood_frame.plot(x='P_H', y='Likelihood');

sns.despine(offset=10, trim=True)

Of course: however small the sample size, the sample represents just everything that we know about a coin; if we observe all heads, like in: HH, then the most probable coin that could have produced the result is definitely the one that yields Heads all the time!

2.1 MLE in Linear Regression¶

In [ ]:
iris = pd.read_csv(os.path.join(data_dir, 'iris.csv'))
iris.head(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
In [ ]:
linear_model = smf.ols(formula='petal_length ~ sepal_length', data=iris).fit()
linear_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: petal_length R-squared: 0.760
Model: OLS Adj. R-squared: 0.758
Method: Least Squares F-statistic: 468.6
Date: Tue, 25 Apr 2023 Prob (F-statistic): 1.04e-47
Time: 16:54:32 Log-Likelihood: -190.49
No. Observations: 150 AIC: 385.0
Df Residuals: 148 BIC: 391.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -7.0954 0.506 -14.011 0.000 -8.096 -6.095
sepal_length 1.8575 0.086 21.646 0.000 1.688 2.027
Omnibus: 0.255 Durbin-Watson: 1.204
Prob(Omnibus): 0.880 Jarque-Bera (JB): 0.384
Skew: -0.084 Prob(JB): 0.825
Kurtosis: 2.817 Cond. No. 43.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
iris['Prediction'] = linear_model.predict()
iris['Residuals'] = linear_model.resid
iris.head(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width class Prediction Residuals
0 5.1 3.5 1.4 0.2 setosa 2.377918 -0.977918
1 4.9 3.0 1.4 0.2 setosa 2.006416 -0.606416
2 4.7 3.2 1.3 0.2 setosa 1.634914 -0.334914
3 4.6 3.1 1.5 0.2 setosa 1.449163 0.050837
4 5.0 3.6 1.4 0.2 setosa 2.192167 -0.792167
5 5.4 3.9 1.7 0.4 setosa 2.935171 -1.235171
6 4.6 3.4 1.4 0.3 setosa 1.449163 -0.049163
7 5.0 3.4 1.5 0.2 setosa 2.192167 -0.692167
8 4.4 2.9 1.4 0.2 setosa 1.077661 0.322339
9 4.9 3.1 1.5 0.1 setosa 2.006416 -0.506416
In [ ]:
# - plotting the true data, predicted values and the prediction line
sns.regplot(data=iris, x='sepal_length', y='petal_length', ci=0, line_kws={'color':'red'})
sns.scatterplot(data=iris, x='sepal_length', y='Prediction', color='red', s=50)
sns.despine(offset=10, trim=True)
plt.title('petal_length ~ sepal_length', fontsize=14);
In [ ]:
print("Pearson's correlation (R-value): " + str(round(np.sqrt(linear_model.rsquared), 2)))
print("Coefficient of determination (R^2): " + str(round(linear_model.rsquared, 2)))
Pearson's correlation (R-value): 0.87
Coefficient of determination (R^2): 0.76
In [ ]:
# - normal distribution
from scipy.stats import norm
from scipy.optimize import minimize

# - nll function
def lg_nll(pars, data):
    # - pick up the parameters
    beta_0 = pars[0]
    beta_1 = pars[1]
    sigma = np.abs(pars[2])
    # - predict from parameters
    preds = beta_0+beta_1*data['sepal_length']
    # - compute residuals
    residuals = data['petal_length']-preds
    # - negative log-likelihood
    nll = -np.sum(norm.logpdf(residuals, loc=0, scale=sigma))
    # - out:
    return nll

# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-15, high=15, size=1)
init_beta_1 = np.random.uniform(low=-15, high=15, size=1)
init_sigma = np.random.uniform(low=.1, high=10, size=1)
init_pars = [init_beta_0, init_beta_1, init_sigma]

# - optimize w. Nelder-Mead
optimal_model = minimize(
    # - fun(parameters, args)
    fun=lg_nll,
    args = (iris), 
    x0 = init_pars, 
    method='Nelder-Mead',
    options={'maxiter':10e6, 
            'maxfev':10e6,
            'fatol':10e-12})

# - optimal parameters
optimal_model.x
/tmp/ipykernel_652339/1485672469.py:27: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
  optimal_model = minimize(
Out[ ]:
array([-7.0953817 ,  1.8575097 ,  0.86157998])
In [ ]:
# Check against `statsmodels`
linear_model.params
Out[ ]:
Intercept      -7.095381
sepal_length    1.857510
dtype: float64
In [ ]:
# Log-Likelihood
loglike = -lg_nll(optimal_model.x, data=iris)
print(loglike)
-190.49268265220317
In [ ]:
linear_model.llf
Out[ ]:
-190.49268265220212
In [ ]:
beta_0_vals = np.linspace(-10,0,300)
beta_1_vals = np.linspace(0,3,300)

sigma = optimal_model.x[2]

grid = np.array([(beta_0, beta_1) for beta_0 in beta_0_vals for beta_1 in beta_1_vals])

grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_0", 1: "beta_1"})

nll = []

for i in range(grid.shape[0]):
    pars = [grid['beta_0'][i],grid['beta_1'][i],sigma]
    nll.append(lg_nll(pars, iris))

grid['likelihood'] = nll
grid['likelihood'] = np.exp(-grid['likelihood'])
grid.sort_values('likelihood', ascending=False, inplace=True)
grid.head()
Out[ ]:
beta_0 beta_1 likelihood
26285 -7.090301 1.856187 1.860885e-83
26884 -7.023411 1.846154 1.840102e-83
25686 -7.157191 1.866221 1.830589e-83
25387 -7.190635 1.876254 1.780881e-83
27483 -6.956522 1.836120 1.769939e-83
In [ ]:
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
    x=grid['beta_0'], 
    y=grid['beta_1'], 
    z=grid['likelihood'], 
    color='red', 
    opacity=0.50)])
fig.update_layout(scene = dict(
                    xaxis_title='Beta_0',
                    yaxis_title='Beta_1',
                    zaxis_title='Likelihood'),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10))
fig.show()

Further Reading¶

  • StatQuest w. Josh Starmer, Regularization Part 1: Ridge (L2) Regression
  • StatQuest w. Josh Starmer, Regularization Part 2: LASSO (L1) Regression
  • StatQuest w. Josh Starmer, Regularization Part 2: Elastic Net Regression
  • Khan Academy, Algebra 2, Unit: Logarithms
  • StatQuest w. Josh Starmer, Maximum Likelihood, clearly explained!!!
  • Timothy Schulz, Introduction to Maximum Likelihood Estimation

DataKolektiv, 2022/23.

hello@datakolektiv.com

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.